Hardware-software co-design of an open-source automatic multimodal whole slide histopathology imaging system

Abstract. Significance Advanced digital control of microscopes and programmable data acquisition workflows have become increasingly important for improving the throughput and reproducibility of optical imaging experiments. Combinations of imaging modalities have enabled a more comprehensive understanding of tissue biology and tumor microenvironments in histopathological studies. However, insufficient imaging throughput and complicated workflows still limit the scalability of multimodal histopathology imaging. Aim We present a hardware-software co-design of a whole slide scanning system for high-throughput multimodal tissue imaging, including brightfield (BF) and laser scanning microscopy. Approach The system can automatically detect regions of interest using deep neural networks in a low-magnification rapid BF scan of the tissue slide and then conduct high-resolution BF scanning and laser scanning imaging on targeted regions with deep learning-based run-time denoising and resolution enhancement. The acquisition workflow is built using Pycro-Manager, a Python package that bridges hardware control libraries of the Java-based open-source microscopy software Micro-Manager in a Python environment. Results The system can achieve optimized imaging settings for both modalities with minimized human intervention and speed up the laser scanning by an order of magnitude with run-time image processing. Conclusions The system integrates the acquisition pipeline and data analysis pipeline into a single workflow that improves the throughput and reproducibility of multimodal histopathological imaging.

standard for the assessment of disease. The development of digital slide scanners and whole slide image (WSI) scanning techniques has enabled conventional hematoxylin and eosin (H&E)stained glass slides to be converted into digital images with microscopic resolutions for diagnosis, research, and medical education. 1,2 Consequently, digital pathology has become a popular research area in which innovation is sought in the analysis of digital images and the development of computational instruments to increase diagnostic accuracy and imaging throughput. 3,4 The attention has further increased due to the success of deep neural networks (DNNs) in image analysis tasks such as image classification, segmentation, and image restoration. 5,6 Although in hospitals most slides are stained with H&E and viewed under a brightfield (BF) microscope, advances in optics have greatly diversified the ways tissues can be examined in research labs. Techniques such as fluorescence microscopy, 7 multiphoton microscopy, 8,9 polarization and phase microscopy, [10][11][12][13] and chemical imaging 14,15 have become increasingly popular in tissue research because of their ability to provide additional information and context about cells and their environment. Scientific findings from research laboratories have been applied to solve clinical problems, with many of them proving to have diagnostic and prognostic applications. For example, tumor-associated collagen signatures, 16,17 a negatively prognostic biomarker defined by alterations in collagen orientation and deposition during tumor progression, were discovered using second-harmonic generation microscopy (SHG). 9 More complex optical modalities, such as fluorescence lifetime microscopy (FLIM) and Raman spectroscopy, have also emerged as powerful tools for studying cancer metabolism. 14,18,19 Despite the popularity of multimodal histopathology studies in research, the scales of these studies are often limited due to the high complexity and/or cost of the imaging instruments and the low throughput of the imaging experiments. Performing multimodal imaging can be timeconsuming for whole tissue sections at high image resolution. This scaling problem is especially problematic for modalities that involve point scanning and hardware parameter tuning. For example, scanning a whole tissue section using SHG at a resolution necessary to provide collagen fiber information could easily take several days to complete, all while needing to maintain the instrument's focus on the tissue sample. Reducing the complexity and increasing the throughput of multimodal imaging experiments for histopathology remains an open challenge. [20][21][22] From the advancements in open-source software developed for bioimage acquisition and analysis, image acquisition and analysis workflows can now be automated using scripts and executed efficiently to allow for batch acquisitions. For histopathological image analysis, tools such as QuPath, 23 Cytomine, 24 Histomicstk, 25 Fiji/ImageJ, 26 and CellProfiler 27 are frequently used for image visualization, segmentation, and cell quantification. To integrate run-time analysis into the acquisition workflow, the open-source microscope control platform Micro-Manager 28 provides an interface to call Fiji/ImageJ during the acquisition, which provides a basis for building integrated acquisition-analysis workflows. However, Fiji/ImageJ is not designed to handle pyramidal WSI (which can have dimensions up to 100; 000 × 100; 000 pixels) and leads to suboptimal performance or even failure to load the images at all. Moreover, workflows scripted in the ImageJ-Micro-Manager environment are often limited to simple workflows with insufficient flexibility, such as the lack of machine learning libraries and access to graphics processing units. Integration of the latest tools and models in computer vision and machine learning, such as DNNs, remains challenging.
Leveraging machine learning in microscopy automation is a promising solution that enables the design of sophisticated imaging experiments and improve experimental throughput. Conrad et al. 29 developed a LabVIEW-based intelligent acquisition system empowered by a machine learning model that automatically triggered high-resolution (HR) scans based on cell events, such as mitosis, detected in low-resolution (LR) screening. This automation minimized the time-intensive work that the researchers otherwise have had to spend on low-level observations and allowed them to concentrate on high-level experiment design and data analysis. Complex and flexible acquisition workflows with run-time analysis capabilities can also be achieved in the Konstanz Information Miner (KNIME), 30 a powerful analytic platform that enables the integration of image analysis and microscope control in a closed loop. However, both LabVIEW and KNIME lack both robust support for the large variety of microscopy devices available and the ability to handle WSI datasets. The recently developed open-source Python package Pycro-Manager 31 allows for complex acquisition and analysis pipelines that involve microscope control, data acquisition, and data analysis to be programmed in a single Python environment. Moreover, acquisitions with run-time analysis and feedback loops can be built and customized, utilizing the power of a wide variety of data analysis Python packages, including popular deep learning (DL) platforms, such as PyTorch and Tensorflow. In addition, platforms developed for handling and/or analyzing WSI, such as OpenSlide 32 and QuPath, 23 could be utilized in Python via their Python or command line interfaces. Combining the flexibility of the Python programming environment and the ability to access Micro-Manager Java libraries make Pycro-Manager a suitable tool for building an open-source automatic multimodal histopathological imaging system.
In this paper, we present a hardware-software co-designed imaging system for highthroughput multimodal histopathology. The optical path is designed with two coupled light paths for BF and laser scanning microscopy (LSM) that allows for switching between these two modalities using a simple shutter control. The system is equipped with a three-axis motorized stage, a motorized condenser, and a motorized dual objective slider. The motorized components can be adjusted via the software to form optimal configurations for different combinations of magnifications and modalities. The acquisition programs of our system are written in Python using Pycro-Manager, presented in Jupyter Notebook. Our system has a trainable DL-based detection model that automatically detects targeted regions in a low-magnification rapid scan and switches to higher magnification for BF or LSM acquisition at targeted regions. Annotations can also be made manually and imported from QuPath, 23 where the region of interests (ROIs) can be marked and translated to the stage coordinate system. During acquisition, in addition to standard run-time image correction algorithms, such as software autofocus and white balance, DL-based image restoration models can be applied to LSM images, including denoising and resolution enhancement. Evaluation of the DL-based image restoration methods on a pancreatic tissue microarray (TMA) slides set shows that the denoising and resolution enhancement model can improve the image signal-to-noise ratio (SNR) and image resolution, leading to shorter scan times while maintaining the necessary image quality for downstream image analysis. With selective acquisitions and run-time image restoration integrated into a single acquisition-analysis workflow, the system can considerably improve the throughput and repeatability of multimodal imaging for histopathology.

Methods
The system consists of three major components; a Pycro-Manager-based acquisition program written in Jupyter Notebook, a SHG-BF coupled microscopy system, and DL-based image analysis modules built using PyTorch. The overall workflow is illustrated in Fig. 1. Once the sample is mounted on the stage, the system first captures a BF scan of the whole slide area at low magnification and produces a stitched pyramidal OME.TIFF file. 33 Annotations can then be generated in two ways: 1) manual annotations are made in QuPath (v0.3.2) and exported as lists of coordinates within a CSV file or 2) Annotations are automatically generated by a DL model and exported as coordinate files. The coordinates files are then passed to the acquisition program and converted into position lists for selective acquisitions at high magnification, first for BF including an autofocus step and then for LSM. During the LSM acquisition, two types of DL-based image enhancement models, self-supervised denoising (SSD) and single-image superresolution (SISR), are enabled to improve the apparent quality of the scanning results. The system will be explained in more detail in the following sections.

Software Structure and Acquisition Workflow
The acquisition program is built on Pycro-Manager (v0.18.1). Pycro-Manager is a Python package that communicates with Micro-Manager, which handles hardware control via Micro-Manager's device adaptors, such as the laser scanning module OpenScan. The fast data transfer and translation layer between Python and Java in Pycro-Manager allows Java libraries to be called as if they were in Python. Because Micro-Manager already supports a large variety of microscopy hardware, integrating Pycro-Manager allows for customized acquisitions to be programmed in Python, enabling analytical Python packages to be easily inserted into the acquisition pipeline in a single Python environment.
The acquisition program is written in Jupyter Notebook. The program first deploys a fast prescan at low magnification (e.g., 4×). A position list covering the whole slide area is automatically generated according to the slide size, field of view size, and image pixel size. After the acquisition, the tiles are stitched according to the position list using the grid/collection plugin 34 from ImageJ 26 accessed from Python via PyImageJ. During the acquisition, software autofocus is applied (see Appendix B for details of the autofocus algorithm), and the focus position is recorded for each position. This is necessary because a tissue section mounted on a glass slide can have variations in terrain heights, which leads to varying focal planes at different locations. Background tiles are automatically recorded and averaged to produce a background image, which is used for white balancing and illumination correction. This scan can be done relatively quickly because of the large field of view (FOV) of each frame at low magnification. A 4× magnification objective is used for the low-magnification scan because it provides a good balance between the resolution of details needed to identify some tissue features and a wide FOV. The stitched prescan image is then exported in a BioFormats compatible pyramidal format (OME TIFF) with slide metadata (e.g., XY pixel size, objective magnification) by calling QuPath via its command line interface from Jupyter Notebook.
The user can then make annotations by opening the low-magnification prescan image in QuPath. A QuPath script is used to generate acquisition tiles, shown in an overlay on the image, covering the annotated areas according to the FOV size at 20× magnification for BF and LSM. Alternatively, the annotations can be generated automatically by DL-based detection models (details covered in the following section). The acquisition program then uses the annotation(s) to perform a selective acquisition at a higher magnification with BF and LSM using the positions saved in the files. The LSM control is handled by an in-house software-OpenScan, 35 a Micro-Manager device library that handles the generation of scanning waveforms, the control of scanning galvanometer, the scanning resolution, and the photomultiplier tube (PMT). The interactions of the software used in our system are illustrated in Fig. 2.
The focus of each tile at 20× BF is first roughly determined by interpolating the focus map recorded in the 4× prescan and then fine-tuned using the same autofocus algorithm with a narrower searching range during the 20× BF acquisition. The focus of each tile is again recorded and the focus map is interpolated and sampled to generate a more finely detailed focus map to be used for the LSM modality (the 20× LSM and BF FOV sizes can be slightly different, and there will be a Z offset between modalities). For multiphoton LSM or SHG, a z-stack is often beneficial due to the optical sectioning property of multiphoton microscopy resulting in incomplete coverage of the full depth of the tissue. 9 The step size and range of the z-stack can be configured according to the sample thickness; the following experiments used a 5-μm step size with three steps. With the center of the tissue at each location roughly determined in the 20× BF scan, the number of slices in the z-stack of LSM can be reduced such that it only covers the depth around the center of the tissue at each location.

Multimodal Imaging System
The imaging system is designed to couple BF and multiphoton LSM light paths with the ability to switch between the two modalities via simple shutter control. A Tsunami Ti:Sapphire laser (Spectra-Physics, Santa Clara, California) tuned to 800 nm, with a pulse length of ∼100 fs, is directed through a Pockels cell (ConOptics, Danbury, Connecticut), half and quarter waveplates (ThorLabs, Newton, New Jersey), a beam expander (ThorLabs), a 3-mm galvanometer driven mirror pair (Cambridge, Bedford, Massachusetts), a scan (f ¼ 75 mm) and tube (f ¼ 250 mm) lens pair (ThorLabs), and a dichroic mirror (Semrock, Rochester, New York) and is focused by a 20 × ∕0.75 NA air objective (Nikon, Melville, New York). SHG light is collected in the forward direction with a variable NA condenser (Olympus, Lombard, Illinois) and filtered with a 680-nm short-pass filter (Semrock) and an interference filter centered at 400 nm with a full-width at half-maximum bandwidth of 10 nm (ThorLabs). The back aperture of the condenser lens is focused onto a H7422-40P GaAsP PMT (Hamamatsu, Hamamatsu, Japan) using a secondary collection lens (f ¼ 150 mm, ThorLabs). The signal from the PMT is amplified with a C7319 integrating amplifier (Hamamatsu) and sampled with an analog data acquisition device NI PIXe-6356 (National Instruments, Austin, Texas). The galvanometer is controlled by an analog data output device NI PXIe-6738 (National Instruments, Austin, Texas). Timing among the galvo scanners, signal acquisition, and motorized stage positioning is achieved using our custom software called OpenScan in conjunction with the open-source software Micro-Manager (v2.0). 28 The Rapid Automated Modular Microscope system (Applied Scientific Instrumentation, Eugene, Oregon) serves as our microscope base, and ASI motorized translation stages are used for x, y, and z motion control. BF images are captured with the same system using a MCWHL2 white LED lamp (ThorLabs) and both the 20 × ∕0.75 NA objective and a 4 × ∕0.13 NA objective (Nikon). White light from this lamp travels through the condenser directed by a short-pass dichroic mirror with a cutoff at 670 nm (Semrock). The white light is passed through the first dichroic, focused on an RGB camera (QICAM Fast 1394, Qimaging, Surrey, BC, Canada) with a collection lens (f ¼ 230 mm, ThorLabs). The image data transfer is handled by OpenScan/ Micro-Manager. An objective slider (ASI) is used for easily switching between 4× and 20× objectives and a z-oriented motorized arm (f-stage, ASI) for optimizing the amount of light collected by the condenser for each imaging modality and objective position. All components of the laser light focusing and collection are contained in a blackout box as shown by the dashed line in Fig. 3.

Automatic ROIs Detection with Deep Neural Networks
In addition to acquiring annotated regions defined by user-entered annotations, our system can generate annotations automatically via DL-based detection models.
Convolutional neural networks (CNNs) have demonstrated state-of-the-art performance in many vision tasks and have been extensively used in computational histopathology for tasks such as tumor detection, gland segmentation, cell segmentation, and cell classification. [36][37][38][39] A CNN differs from traditional machine learning methods in that the feature extractor is parameterized by layers of convolutions with learnable filters and nonlinear functions. The parameters of the filters are learned during the training, whereas most of the traditional machine Fig. 3 Optical schematic of the laser and BF light path. The 800-nm laser light is passed through an electro-optic modulator (Pockels Cell) and a half-wave plate followed by a quarter-wave plate (λ∕2, λ∕4). The polarized light is focused by a 75-mm focal length scan lens and a 250-mm focal length tube lens to the back aperture of a Nikon 20 × ∕0.75 NA objective lens. Light from the sample is collected by an adjustable NA condenser lens and passed through a 670-nm dichroic mirror (FF670), a 680-nm short-pass filter (680 SP), and a 400-nm interference filter (400/10) and is focused by a 150-mm focal length collection lens to a Hamamatsu 7422-40P photomultiplier tube (PMT). White light is generated by an LED lamp and passed through the condenser and the sample and collected by either a Nikon 4 × ∕0.13 NA objective or the 20× objective. The light passes through a 720-nm dichroic mirror (FF 720) and is focused by a 230-mm focal length tube lens onto a Qimaging QICAM Fast 1394 camera (RGB Camera).
learning models for vision tasks use handcrafted features, such as SIFT. 40 With a sufficient amount of training data available, CNNs usually outperform traditional machine learning models and show better generalizability to unseen data. In our design, we use ResNet 41 as the CNN backbone for building the detector. ResNet uses residual connections that allow for more convolution layers to be used in the network without suffering from gradient vanishing, and it can alleviate overfitting by providing "shortcuts" for certain information to skip redundant convolutional layers. The training settings can be found in Appendix A.1 The trained CNN are enabled in the acquisition workflow. The low-magnification scan is processed by the CNN, and the coordinates of positive areas are generated and saved in a position list. The position list is then used by the acquisition program for high magnification BF and SHG imaging.

Deep Learning-Based Run-Time Image Restoration
Performing laser scanning imaging, such as SHG, can be time-consuming due to the point-wise scanning only collecting a single pixel at a time. The imaging time is further extended if a z-stack is required to cover the sample thickness. A complete scan of a tissue section could take several days to complete. Thus solutions that enable faster scanning while maintaining image quality are desirable. The image quality is mainly determined by the SNR and scanning resolution. Generally, the SNR of LSM is proportional to the square root of the number of photons (which is proportional to the dwelling time of each scanning point). The scanning resolution of the images is determined by the number of scanning points along each dimension (apart from optical resolution determined by the NA of the objective). Ideally, the scanning resolution should match the optical resolution such that Nyquist sampling for the optical signals is met. This means that a lower scanning resolution might undersample the optical signals and a higher scanning resolution, i.e., more scanning points along each dimension, would be necessary to increase the image resolution. Due to LSM deploying point-to-point scanning, the scanning time increases quadratically with the scanning resolution. Fortunately, recent advances in DL-based image restoration and enhancement techniques have brought new solutions to address this problem. In our system, two types of DL-based image enhancement models are implemented to shorten the scanning time while preserving the image quality, including an SSD model and a SISR model. With a successfully trained model enabled during the acquisition, the scanning procedures can be performed an order of magnitude faster while maintaining image quality comparable to that of the original scan.

Self-supervised denoising model
The signals that form an image are considered to be x ¼ s þ n, where s and n are drawn from two the joint distributions: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 2 7 8 pðs; nÞ ¼ pðsÞpðnjsÞ: (1) Within a given radius R of i, the signal value s i conditionally depends on the observation of signals at other locations within the radius (i.e., pðs i Þ ≠ pðs i js j Þ s. t. fs j ∈ R; j ≠ ig) but is independent of the noise at other locations (i.e., pðs i Þ ¼ pðs i jn j Þ s. t. fn j ∈ R; j ≠ ig), meaning that the distribution of the noise conditioning on the signal factorizes as Under these assumptions, the expectation of the pixel value at i is estimated by a blind spot network that observes the surrounding region of the pixel i. 42 Because E½x i ¼ E½s i þ E½n i ¼ E½s i (assuming the noise is zero mean, i.e., E½n i ¼ 0), the estimation is free of noise. 42 This suggests that the model can estimate the clean image even without observing clean images in the training phase. Such methods are known as self-supervised learning (SSl) denoising. 43 Models such as Noise2Void have been successfully deployed for denoising microscopic data and tested for the cases of Poisson noise and Gaussian noise. 42 We used the same idea but modified the original implementation of Noise2Void by making use of random dropout and stochastic forward pass similar to Monte Carlo dropout. 44 In the original training process of Noise2Void, subregions are randomly cropped from the image. In each subregion, a randomly selected pixel is replaced by another randomly selected pixel. The network is then trained to predict the original pixel value at the replaced location. In the inference, the predictions are then obtained at each location using a sliding window. In our training scheme, we applied random dropout on the input image with a small dropout rate p that creates random blind spots in the input image. The neural network was then trained to predict the missing values at the blind spots. The loss was only evaluated at the blind spot locations to prevent the network from overfitting, which could result in an identity mapping. At inference, a number of k dropout operations with the same dropout rate p were applied to the input image, and k outputs were obtained. Only the pixel values at the blind spot locations in each output were kept. The final output was then computed by averaging the k outputs, similar to the stochastic forward pass used in Bayesian neural networks. We used small values <0.1 for the dropout rate p to ensure that the created blind spots were sparse enough such that each blind spot had enough surrounding pixels observed by the network to achieve a good estimation of the missing pixels. Also the values of k need to be set inversely proportional to p such that the probability of creating a blind spot at each pixel location at least once is close to 1. The training and inference schemes are illustrated in Fig. 4.

Supervised single-image super-resolution model
The goal of SISR models is to estimate the HR image from a given LR image. Let l and h denote an LR image and its corresponding HR image, respectively. The relation between l and h can be written as l ¼ h Ã f , where Ã denotes a convolution operation and f denotes the blurring operator. In microscopy, the inverse problem l Ã f −1 ¼ h can be intractable due to the difficulty of finding the true point spread function (blurring operator) and solving the deconvolution itself. 45 Recently, CNN-based solutions have been proposed to estimate the inverse operation and have achieved state-of-the-art performances. 46 Most of the CNN-based solutions rely on training on a large amount of LR-HR image pairs, with the difference measurements between the network outputs and ground-truth HR images being minimized. Recently, generative adversarial networks (GANs) have been introduced to enable the CNN to generate HR images with higher visual quality. 47 Moreover, similar example-based networks can also be applied to the image denoising problem, with the network being trained on noisy-clean image pairs. 48,49 Details of the training specifics can be found in Appendix A.2. The CNN backbone of the models is similar to the image-to-image translation model used in Ref. 50, but with only one input channel (Fig. 5). Although the denoising model can be trained solely using noisy images, the SISR model needs LR/HR image pairs to train. For the SISR model, the input images can be collected at both a low scanning resolution and fast scanning rate (fewer pixels along each dimension and a lower SNR), whereas the "ground truth" images can be acquired at a high scanning resolution with a slow scan rate (more pixels along each dimension and a high SNR). The SISR model can then achieve SISR and denoising simultaneously. Once the models are trained, the user can enable the models during the LSM acquisition for run-time image resolution enhancement and denoising.
To enable the generation of HR and high-SNR images with sharper details and better perceptual quality, we incorporate GANs and perceptual loss for training the image-to-image translation network. GANs have been used in many inverse problems, such as SISR, 51 image inpainting, 52 and style transfer. 53 The discriminator of the implemented GAN consists of five convolutional layers, each followed by a ReLU activation function. The output of the last convolutional layer is processed by an average-pooling layer and fed to a linear layer that produces a single prediction. The discriminator serves as a surrogate to push the generator (the image-toimage translation network) to generate outputs that are indistinguishable by the discriminator.
In addition to GANs, another frequently used approach to increase the visual quality of the generated images is perceptual loss. 54 Perceptual loss is usually measured by a pretrained CNN in which the features extracted by the CNN are compared between the generated image and the ground truth image. The CNN is pretrained on a third-party dataset (usually ImageNet), and the weights are frozen during the training process of the image-to-image translation network. Thus the features extracted by the CNN can be treated as static latent representations of the input images with structural and semantic meanings. Measuring the distances between the generated image and ground truth image in the latent space provides complementary information that is not available from a pixel-wise loss, such as mean-absolute error (L1 loss) between pair of pixels in the two images. In our implementation, we used ImageNet pretrained VGG16 to extract latent features, 55 as suggested in Ref. 54. Consider a dataset D ¼ fx i ; y i g N i¼1 , where N equals the number of input-target image pairs. The optimization objective of network training is written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 1 5 6 where G is the generator parameterized by an image-to-image translation network, D is the discriminator described above, and V VGG is a pretrained VGG16 network. λ p is the weight for the perceptual loss, and λ g is the weight for the adversarial loss. The resulting training pipeline that optimizes all three kinds of loss functions is illustrated in Fig. 6. Similar concepts have been implemented for autofluorescence-harmonic microscopy. 56 Our major contribution lies in integrating the methods into a run-time analysis workflow along with the acquisition. This system would also facilitate the collection of paired image data of different resolutions and different modalities for training such DL models. We evaluated the efficacy of the image-to-image translation network for denoising and resolution enhancement of SHG images, with results presented in the next section.
The TMAs contain a total of 769 TMA cores of pancreatic cancer (pancreatic ductal adenocarcinoma), chronic pancreatitis, and normal pancreas tissue. Cores were generally 2 mm in diameter and 4-μm thick. PA2081b, containing a mixture of the three types of cores, was used to collect data with annotations generated automatically by the machine learning detector.
This TMA slide was also used for testing the run-time LSM image denoising and resolution enhance models. The rest of the slides were used for collecting BF and SHG data from manually drawn annotations and the data were used to train the DNNs. Further details regarding each TMA slide can be found in Ref. 57. The hardware of the machine used to control the imaging system is as follows: CPU: Intel ® Core™ i9-10900X CPU 3.70 GHz (10 cores), RAM: 32 GB, GPU: NVIDIA RTX A4000 (16 GB).

Datasets for machine learning tumor region detector
TMA cores were separated into two categories, 459 malignant (pancreatic cancer) and 310 benign cores (chronic pancreatitis and normal pancreas tissue). The BF images captured at 4× were used for training and validation after being divided into patches with a size of 224 × 224 without overlap.
Low saturation patches (mean saturation <0.15) were discarded to exclude background patches, with saturation defined as per the hue, saturation, brightness color model. 80% of the cores (18,631 patches) were used for training, and 20% of the cores (4658 patches) were used for validation.

Datasets for run-time image restoration
Slides BBS14011, PA2072, and PA961e were used to collect BF images and SHG images at 20×.
SHG images were collected using two settings. (1) Low-quality setting: scan rate at 500,000 Hz, scanning resolution at 256 × 256 (pixel size 0.509 μm per pixel). (2) High-quality setting: scan rate at 100,000 Hz, scanning resolution at 512 × 512 (pixel size 0.255 μm per pixel). The three TMA slides yielded 28,464 tiles, each containing a z-stack of five slices, excluding low signal tiles (mean pixel value <0.1). 20% of the tiles were split for validation during the training to prevent overfitting. The trained network was then used to perform prediction on the testing TMA slide (PA2081b) as a part of the automatic acquisition workflow. The z-stack step size was 4 μm.
The total z-stack range covered small variations in the tissue surface. In training, each slice of a z-stack tile was treated as an input image.

Manual annotation
Once the slide is mounted on the stage, the rapid 4× scan function is executed. Image tiles are automatically white-balanced, flat-field corrected, stitched, and exported as a pyramidal OME.TIFF 33,58 files. To make annotations manually, the user opens the 4× image in QuPath and draw annotations with QuPath annotation tools. Annotations with arbitrary closed shapes are supported. The system uses a set of QuPath scripts (written in Groovy) that can convert the annotations into stage position lists that can be read by the acquisition program. The position list is automatically passed to the acquisition program when the 20× and SHG acquisition workflows are executed. Multiple position lists are generated and executed in a loop for multiple annotations on a slide. The acquisition procedure and results of this functionality are shown in Fig. 7.

Automatic annotation by machine learning
Once the rapid scan at 4× finishes, the tiles can be examined by a machine learning model. For the training of the supervised learning model, all patches extracted from a malignant core were considered positive, and all patches from a benign core were considered negative. The classification scores of the patches were averaged for each core to reach a core-level prediction. After training, the supervised model achieved an accuracy of 94.5% in differentiating pancreatic cancer cores from benign pancreas cores in the validation set. At acquisition, the model processes the tiles, makes predictions on the nonempty tiles (mean saturation >0.15), and generates a position list containing the locations of the positive areas. The position list is then imported to QuPath and displayed as annotations by executing a QuPath script. The detection model successfully classified 175 cores (out of 192 cores) on the testing TMA, with an accuracy of 91.1% and a recall of 0.942. The downstream acquisition at 20× for BF and SHG were then performed using the position list, as shown in Fig. 7.

Run-Time Image Restoration
The trained denoising model and resolution enhancement model are enabled during the acquisition by specifying them as the image processing hook function in the Pycro-Anager Acquisition API. The image processing hook function provides access to the data-stream as data is being acquired. This allows the data to be modified, analyzed, and diverted to customized visualization and saving on-the-fly. For evaluating the SSD model and SISR model, the models are enabled with a fast scanning pixel rate of 0.5 MHz at a 256 × 256 scanning resolution. The processed images are compared with the ground truth images collected using a slow scanning pixel rate of 0.1 MHz at a 512 × 512 scanning resolution. Note that the same scanning rates and scanning resolutions are used to collect the training data for model training. The SSD model is trained using SHG images collected at 0.5 MHz at a 256 × 256 scanning resolution, whereas the SISR model is trained to map SHG images collected at 0.1 MHz at a 256 × 256 scanning resolution to images collected at 0.1 MHz at a 512 × 512 scanning resolution.
Four metrics are used to evaluate the similarity between the network outputs and the ground truth: peak-signal-to-noise ratio (PSNR), structural similarity (SSIM), 59 Fréchet inception distance (FID), 60 and CurveAlign collagen fiber statistics. 61 PSNR is a pixel-wise metric that measures the pair-wise distances of pixels in the same locations of two images. SSIM and FID are perception-based metrics that quantify the visual similarity between images based on measurements derived from the whole images or regions instead of individual pixels. SSIM follows an explicitly defined formula, whereas FID is based on features computed from pretrained Inception networks. 62 FID is shown to have the ability to produce measurements that align well with visual observations of human, and it is currently the standard metric to assess the quality of images generated by GANs. 63 CurveAlign is a popular tool box for characterizing collagen fiber topography, such as measuring collagen fiber density, and fiber alignment coefficients. 61 We computed the absolute differences between the collagen fiber alignment coefficient and collagen fiber density calculated by CurveAlign between the processed image and ground truth image. The absolute difference for each image is than divided by the maximum value of alignment coefficient and density in the testing set, respectively, resulting in absolute error ratios for alignment coefficient and density. The ratios are then averaged across the testing set. This metric can be seen as another type of image-level metric with collagen-specific domain knowledge, selected due to collagen structure being the biologically relevant focus of the SHG imaging.
The outputs of the supervised method (SISR network) and self-supervised method are compared with several traditional denoising methods, including median filter, wavelet-based denoising, 64 and total variation (TV)-based denoising. 65 After performing denoising, the resulting images are upscaled using bicubic interpolation to match the size of the ground truth images collected using a slower scanning rate and a higher scanning resolution. Note that the SISR network performs denoising and upscaling simultaneously in a single network and bicubic upscaling is not needed. Details of the quantitative evaluation of the outputs are summarized in Table 1. Some representative outputs generated by the baselines and DL-based methods are shown in Fig. 8.
The evaluation results show that for SHG image denoising, SSD compares favorably to traditional methods. For the SISR model, the network processed images better resemble the ground truth images with better PSNR, SSIM, and FID score and more similar collagen fiber alignment coefficient and density. The results also suggest that the addition of GAN and perceptual loss increases the visual quality of the generated images, with a smaller FID score and a higher SSIM.
The computation time of the SISR model is around 0.2 s∕frame, and the scanning time for each frame is reduced from ∼3 to ∼0.15 s (a breakdown of run times can be found in Appendix C, Table 2). The superior performance in image restoration quality and the resulting acquisition acceleration demonstrate the high efficacy of DL-based methods for run-time LSM image processing. Bicubic: use bicubic upsampling on the noisy images. Median + bicubic: use median filter (radius = 1.5) on the noisy images and upsample the outputs using bicubic upsampling. Wavelet + Bicubic: use wavelet-based denoising on the noisy images and upsample the outputs using bicubic upsampling. TV + bicubic: use TV-based denoising on the noisy images and upsample the outputs using bicubic upsampling. Self-supervised + bicubic: upsample the outputs of the SSD network using bicubic upsampling. SISR w/o GAN and perceptual loss: the outputs of the SISR network without using GAN and perceptual loss. SISR with GAN and perceptual loss: the outputs of the SISR network with GAN and perceptual loss enabled during the training. For PSNR and SSIM, larger values suggest a higher similarity between the outputs and the ground truth. SSIM has a value range from 0 to 1, and the value reaches 1 if two images are identical. For FID, alignment error, and density error, smaller values suggest a higher similarity between the outputs and the ground truth.

Conclusion and Future Work
In this paper, we presented a hardware-software co-design for automatic and reproducible multimodal imaging of histological slides. The system integrates the acquisition and analysis in a single workflow that minimizes human intervention and improves the throughput for complex imaging experiments. The Python-based Pycro-Manager bridges the gap between the Python environment and the Java-based open-source microscope control software, Micro-Manager. Thus the controlling program accesses the existing Java libraries to communicate with the hardware device adaptors, including the laser scanning module OpenScan, that drives the microscope hardware, while making use of a large number of Python data analysis packages for building DL models and performing image processing. Together with the coupled BF and LSM light path, our system can switch between modalities and reach optimized configuration presets for each modality and magnification automatically by running Python scripts. QuPath is used for data visualization and manual annotations, after which the stage coordinates computed from the annotations are transferred back into the acquisition program for downstream selective acquisition. The DL-based detection model used in our system can automatically generate annotations for targets in the specimen and switch to point scanning mode for the targeted area. Moreover, our system benefits from run-time image enhancement for LSM that can improve the image SNR and image resolution, leading to shorter scan times while maintaining the necessary image quality for downstream analysis to answer biological questions. It should be noted that such image enhancement has limits, which depend on the details of the modality and the hardware of the system in question. A lower numerical aperture 4× objective, for example, might not generate sufficient information to accurately enhance an image to a final apparent magnification of 20× when performing super-resoution. Such limits should be testable by comparing HR images with paired restored LR counterparts.
It is also worthwhile mentioning that restoring clean/HR images from their noisy/LR counterparts is an under-determined problem, meaning that, for the input (e.g., an LR image), there exist multiple possible outputs (e.g., HR images that can be downsampled to match the same LR image). In this case, AI-generated outputs can be prone to blurriness because the model will tend to output an averaged result of all possible results. This effect will become more prominent as the resolution gap between the input and target output increases because there will be more Fig. 8 Denoising of SHG images: (a) noisy and LR image collected using a faster scanning rate (0.5 MHz) at a 256 × 256 scanning resolution; (b) median filter (radius = 1.5) followed by bicubic upscaling; (c) wavelet-based denoising followed by bicubic upscaling; (d) TV-based denoising followed by bicubic upscaling; (e) SSD followed by bicubic upscaling; (f) supervised denoising using the SISR network without GAN and perceptual loss; (g) supervised denoising using the SISR network with GAN and perceptual loss; and (h) clean and HR image collected using a slow scanning rate (0.1 MHz) at a 512 × 512 scanning resolution. possible HR outputs. Although adversarial loss and perceptual loss are useful for fighting off this effect to some extent, 1 this effect cannot be completely eliminated and sometimes can cause instability for adversarial network training due to the nature of under-determined problems.
Future work includes improving the speed of software autofocus by developing a DL-based one-shot autofocus method 66 for BF imaging and implementing support and denoising models for FLIM, an additional modality that is already supported by Micro-Manager and OpenScan.
Additionally, adapting this work to other system hardware, other systems, and other stains and tissue types would allow for an analysis of how generalizable the method is and how much additional data and training time will be necessary to apply these DL models to new data. For optimization, the workflow would need to integrate training data from the new system and adapt the existing model to the new system.

Detection Model
The CNN backbone used to train the detection models is ResNet18. 41 The optimizer is Adam 67 with a learning rate of 0.0002 and cosine annealing learning scheduler 68 without warm restart. The batch size for the supervised model is 64. The loss function is cross-entropy loss and binary cross-entropy loss, respectively. The number of training epochs is 100.

Runtime Image Enhancement Model
The image-to-image translation CNN backbone is described in Ref. 50, without the discriminator network, with the input and output channel being set to 1 for grayscale LSM images. The optimizer is Adam 67 with a learning rate of 0.0002 and cosine annealing learning scheduler 68 without warm restart. The batch size is 32 and the loss function is mean square error. The number of training epoch is 100. The input dropout rate is p ¼ 0.1 and the number of passes of the stochastic forward pass at inference is 1∕p × 64.

Appendix B: Software Autofocus Algorithm
The software autofocus algorithm (Algorithm 1) that runs during brightfield image acquisitions of tissue sections: Apply canny filter to the image; Sum the filtering result to produce a focus score; Store the focus score in array A;

End
Apply cubic interpolation to A and obtain the upsampled array; Find the corresponding z position value with the maximum focus score in A;